by KRISTIJAN BAKARIC

Introduction to the dataset that EDA will be carried on:

Last updated 03/11/2014

This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, borrower employment status, borrower credit history, and the latest payment information.

There are MANY variables in this data set and you are not expected to explore all of them. You should explore between 10-15 variables in your analysis. Expected time to analyze the dataset 15-30 hours

Since the dataset is to vast to explore in this excercise I will intimately examine only 10-15 variables and create a dataframe from the relevant selection.

This variable dictionary explains the variables in the subset of the data set that I will explore:

##                ListingKey LoanOriginalAmount BorrowerState Term LoanStatus
## 1 1021339766868145413AB3B               9425            CO   36  Completed
## 2 10273602499503308B223C1              10000            CO   36    Current
## 3 0EE9337825851032864889A               3001            GA   36  Completed
## 4 0EF5356002482715299901A              10000            GA   36    Current
## 5 0F023589499656230C5E3E2              15000            MN   36    Current
## 6 0F05359734824199381F61D              15000            NM   60    Current
##            ClosedDate BorrowerRate ListingCategory..numeric.    Occupation
## 1 2009-08-14 00:00:00       0.1580                         0         Other
## 2                <NA>       0.0920                         2  Professional
## 3 2009-12-17 00:00:00       0.2750                         0         Other
## 4                <NA>       0.0974                        16 Skilled Labor
## 5                <NA>       0.2085                         2     Executive
## 6                <NA>       0.1314                         1  Professional
##   EmploymentStatus IsBorrowerHomeowner    IncomeRange DebtToIncomeRatio
## 1    Self-employed                True $25,000-49,999              0.17
## 2         Employed               False $50,000-74,999              0.18
## 3    Not available               False  Not displayed              0.06
## 4         Employed                True $25,000-49,999              0.15
## 5         Employed                True      $100,000+              0.26
## 6         Employed                True      $100,000+              0.36
##   StateCapital LoanOriginationDate
## 1     COLORADO 2007-09-12 00:00:00
## 2     COLORADO 2014-03-03 00:00:00
## 3      GEORGIA 2007-01-17 00:00:00
## 4      GEORGIA 2012-11-01 00:00:00
## 5    MINNESOTA 2013-09-20 00:00:00
## 6   NEW MEXICO 2013-12-24 00:00:00

Univariate Plots Section

##                     variable q_zeros p_zeros  q_na  p_na q_inf p_inf
## 1                 ListingKey       0    0.00     0  0.00     0     0
## 2         LoanOriginalAmount       0    0.00     0  0.00     0     0
## 3              BorrowerState       0    0.00  5515  4.84     0     0
## 4                       Term       0    0.00     0  0.00     0     0
## 5                 LoanStatus       0    0.00     0  0.00     0     0
## 6                 ClosedDate       0    0.00 58848 51.65     0     0
## 7               BorrowerRate       8    0.01     0  0.00     0     0
## 8  ListingCategory..numeric.   16965   14.89     0  0.00     0     0
## 9                 Occupation       0    0.00  3588  3.15     0     0
## 10          EmploymentStatus       0    0.00  2255  1.98     0     0
## 11       IsBorrowerHomeowner       0    0.00     0  0.00     0     0
## 12               IncomeRange       0    0.00     0  0.00     0     0
## 13         DebtToIncomeRatio      19    0.02  8554  7.51     0     0
## 14              StateCapital       0    0.00  5897  5.18     0     0
## 15       LoanOriginationDate       0    0.00     0  0.00     0     0
##         type unique
## 1     factor 113066
## 2    integer   2468
## 3  character     51
## 4    integer      3
## 5     factor     12
## 6     factor   2802
## 7    numeric   2294
## 8    integer     21
## 9     factor     67
## 10    factor      8
## 11    factor      2
## 12    factor      8
## 13   numeric   1207
## 14    factor     50
## 15    factor   1873

From the variables I will explore, BorrowerState, Occupation, EmploymentStatus, DebtToIncomeRatio and State Capital have percentage of missing observations in magnitude of up to 7.5%. CLosedDate for the loans has the biggest loss of observations, 51.6%

##                    ListingKey     LoanOriginalAmount BorrowerState     
##  17A93590655669644DB4C06:     6   Min.   : 1000      Length:113937     
##  349D3587495831350F0F648:     4   1st Qu.: 4000      Class :character  
##  47C1359638497431975670B:     4   Median : 6500      Mode  :character  
##  8474358854651984137201C:     4   Mean   : 8337                        
##  DE8535960513435199406CE:     4   3rd Qu.:12000                        
##  04C13599434217079754AEE:     3   Max.   :35000                        
##  (Other)                :113912                                        
##       Term                       LoanStatus                  ClosedDate   
##  Min.   :12.00   Current              :56576   2014-03-04 00:00:00:  105  
##  1st Qu.:36.00   Completed            :38074   2014-02-19 00:00:00:  100  
##  Median :36.00   Chargedoff           :11992   2014-02-11 00:00:00:   92  
##  Mean   :40.83   Defaulted            : 5018   2012-10-30 00:00:00:   81  
##  3rd Qu.:36.00   Past Due (1-15 days) :  806   2013-02-26 00:00:00:   78  
##  Max.   :60.00   Past Due (31-60 days):  363   (Other)            :54633  
##                  (Other)              : 1108   NA's               :58848  
##   BorrowerRate    ListingCategory..numeric.               Occupation   
##  Min.   :0.0000   Min.   : 0.000            Other              :28617  
##  1st Qu.:0.1340   1st Qu.: 1.000            Professional       :13628  
##  Median :0.1840   Median : 1.000            Computer Programmer: 4478  
##  Mean   :0.1928   Mean   : 2.774            Executive          : 4311  
##  3rd Qu.:0.2500   3rd Qu.: 3.000            Teacher            : 3759  
##  Max.   :0.4975   Max.   :20.000            (Other)            :55556  
##                                             NA's               : 3588  
##       EmploymentStatus IsBorrowerHomeowner         IncomeRange   
##  Employed     :67322   False:56459         $25,000-49,999:32192  
##  Full-time    :26355   True :57478         $50,000-74,999:31050  
##  Self-employed: 6134                       $100,000+     :17337  
##  Not available: 5347                       $75,000-99,999:16916  
##  Other        : 3806                       Not displayed : 7741  
##  (Other)      : 2718                       $1-24,999     : 7274  
##  NA's         : 2255                       (Other)       : 1427  
##  DebtToIncomeRatio     StateCapital            LoanOriginationDate
##  Min.   : 0.000    CALIFORNIA:14717   2014-01-22 00:00:00:   491  
##  1st Qu.: 0.140    TEXAS     : 6842   2013-11-13 00:00:00:   490  
##  Median : 0.220    NEW YORK  : 6729   2014-02-19 00:00:00:   439  
##  Mean   : 0.276    FLORIDA   : 6720   2013-10-16 00:00:00:   434  
##  3rd Qu.: 0.320    ILLINOIS  : 5921   2014-01-28 00:00:00:   339  
##  Max.   :10.010    (Other)   :67111   2013-09-24 00:00:00:   316  
##  NA's   :8554      NA's      : 5897   (Other)            :111428
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1000    4000    6500    8337   12000   35000

Distribution of Original loan amounts is right tailed and it has 5 distinct peaks that needs to be investigated. Mean loan amount is 8337$

##              Cancelled             Chargedoff              Completed 
##                      5                  11992                  38074 
##                Current              Defaulted FinalPaymentInProgress 
##                  56576                   5018                    205 
##   Past Due (>120 days)   Past Due (1-15 days)  Past Due (16-30 days) 
##                     16                    806                    265 
##  Past Due (31-60 days)  Past Due (61-90 days) Past Due (91-120 days) 
##                    363                    313                    304

Largest count of loans is still taged as current. From the past due loans the largest cathegory is past due between 1-15 days.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.140   0.220   0.276   0.320  10.010    8554

##   count ratio_4_times
## 1   367     0.3482535

Mean debth to income ratio is 0.28 and maximum is 10.010. The count to loans that have ratio more than 4 times compared to all loans (minus NA s) is 0.35%.

Length of loans are distributed in 3 cathegories, 1 3 and 5 year loans where count is 3>5>1 years.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.1340  0.1840  0.1928  0.2500  0.4975

Distribution of Borrowers interest loans is quite narrow with a mean at 0.19%. Distribution is a slightly skewed to the left. The values it self seem a bit suspiceous. I would expect to see interest rates at least one magnitude higher from 0 to 5 percent so it could be an issue with the original dataset. I real case scenario I would have a direct dialogue with a domain expert in the topic:)

##             $0      $1-24,999      $100,000+ $25,000-49,999 $50,000-74,999 
##            621           7274          17337          32192          31050 
## $75,000-99,999  Not displayed   Not employed 
##          16916           7741            806

The highest income range for the borrowers is in two cathegories which in total range between 25 000 and 74 999 us dollars.

##      Employed     Full-time Not available  Not employed         Other 
##         67322         26355          5347           835          3806 
##     Part-time       Retired Self-employed          NA's 
##          1088           795          6134          2255
##  [1] "ListingKey"                "LoanOriginalAmount"       
##  [3] "BorrowerState"             "Term"                     
##  [5] "LoanStatus"                "ClosedDate"               
##  [7] "BorrowerRate"              "ListingCategory..numeric."
##  [9] "Occupation"                "EmploymentStatus"         
## [11] "IsBorrowerHomeowner"       "IncomeRange"              
## [13] "DebtToIncomeRatio"         "StateCapital"             
## [15] "LoanOriginationDate"

90% of the borrowers that take loans are either Employed, Full-time eployed or Self-Employed

Top 5 occupations of borrowers were as following: Professional, Computer programmer, Executive, Teacher and Administrative assistant. Bottom of the list is dominated by students.

## False  True 
## 56459 57478

In the entire dataset number of homeowners vs the ones that do not own a home is very similar (False- 56459 vs True - 57478).

Distribution of years when the loan were originated. Note the peak in 2013.

Most common listing cathegory for a loan are Debt consolidation, Home improvement and Business, if we exclude Not available and other.

Univariate Analysis

What is the structure of your dataset?

Subset of the data consists of 15 variables and is spread in 113937 rows. Data completeness report gave us an insight that most of the observations have values apart from the ClosedDate which is missing in 50% of the cases, and that is justifiable since it applies only to Cancelled, Completed, Chargedoff and Defaulted loan statuses.

I didn t manage to understand the root cause for the 871 of duplicated ListingKeys.

What is/are the main feature(s) of interest in your dataset?

Main feature of interest in the dataset is the amount of loan.

What other features in the dataset do you think will help support your

 Did you create any new variables from existing variables in the dataset?

Supporting features that I brought in externally are related to mapping out the US state codes into a human language. I did the same with the listing cathegories but remapping was done from an additional explanatory excel spreadsheet. Spatial context - geometry of the states would be of hel to see spatial distribution of loans and loan related variables.

Of the features you investigated, were there any unusual distributions?

 Did you perform any operations on the data to tidy, adjust, or change the

form  of the data? If so, why did you do this?

Form of the data was not changed, i.e. data is in tidy format.

Bivariate Plots Section

There is a positive relationship when comparing median as a metric between Loan amounts and 1, 3 and 5 years as lenght of loan. Higher to Term, higher the median loan amount.

Top 3 highest loans on average are targeting Debt consolidation, baby and adoption,and business, on the other side of the spectrum top 3 lowest loans on average are targeting student use, personal loans and buying an automobile.

Top 5 on average highest loans are initiated by folowing occupation groups: Judges, Pharmacists and Doctors (dataset filtered on more than 10 000USD loans).

What I expected to see when plotting Original Loan amount against depth to to income ratio is more negative correlation. What is clear in this plot iz that if we exclude >10 debt to income ratio, is that in a range of 1 to 10 of debt to income ratio there is very few data point in the space between 10 000 and 30 000 when comparing it to below 10 000 where the majority of the points lay i.e. burrowers with higher DTI ratio take lower loans.

Highest count of initiated loans in the datacomes from Professionals in the income groups: 50,000-74,999USD and 100,000+ USD (if we exclude combinations defined as Other).

It is evident that median BorrowerRate has been increasing over the years with a peak in 2011 and dropping again towards 2014. In addition there is a trend in narrowing IQR from 2010 onwards.

Medians for loan original amounts have been increasing since 2009 with a peak in 2014. In addition, there has almost not been outliers in a range over 30 000USD until year 2013 and 2014.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the

investigation. How did the feature(s) of interest vary with other

features in the dataset?

Main obesrvations:

  • positive relationship between increase of loan amounts medians and increase of length of loans - not a surprise

  • Medians for loan original amounts have been increasing since 2009 with a peak in 2014. In addition, there has almost no outliers in a range over 30 000USD until year 2013 and 2014.

  • It is evident that median BorrowerRate has been increasing over the years with a peak in 2011 and dropping again towards 2014. In addition there is a trend in narrowing IQR from 2010 onwards.

  • Top 5 on average highest loans are initiated by folowing occupation groups: Judges, Pharmacists and Doctors (dataset filtered on more than 10 000USD loans).

  • Top 3 highest loans on average are targeting Debt consolidation, baby and adoption and business, on the other side of the spectrum top 3 lowest loans on average are targeting student use, personal loans and buying an automobile.

Did you observe any interesting relationships between the other features

(not the main feature(s) of interest)?

Highest count of initiated loans in the datacomes from Professionals in the income groups: 50,000-74,999USD and 100,000+ USD (if we exclude combinations defined as Other)

What was the strongest relationship you found?

THere is no in particular strong relationship found appart from the obvious ones which relate to OriginalLoanAmounts and levels of income, higher paid professions, increase in term length. I would benefit of having more in depth domain knowledge in order to scrutinize some of the more intricate relationships.

Multivariate Plots Section

There is a lot of information in this graph which summarizes mean loans grouped per year faceted per Occupation and colored by a flag if borrower is a homeowner or not. What can be seen at a first glance is that means are either equal or bigger if a borrower is a homeowner. Moreover, means are increasing in general regardless of occupation over the period of last 5 years. Horisonal line on the graph is highlighting mean of 5000USD loans and by focusing on points below that line we can see that they belong to a Student groups which doesn’t seem to change over time significantly and colors are associated to a flag - is not a homeowner.

Along similar lines, if a borrower is a homeowner, he or she will have on average similar or higher loans than if a borrower is not a homeowner. It is interesting to see that some of the ListingCathegories are constrained to last 4 years and some are more expanding over almost entire time span of a dataset, like debt consolidation, home improvements, businesses and automobiles. In addition, personal loans and student use is absent in last 4-6 years.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the

. Were there features that strengthened each other

in terms of at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

There is a lot of information in two graphs in multivariate part which summarizes mean loans grouped per year faceted per Occupation and colored by a flag if borrower is a homeowner or not. What can be seen at a first glance is that means are either equal or bigger if a borrower is a homeowner. Moreover, means are increasing in general regardless of occupation over the period of last 5 years. Horisonal line on the graph is highlighting mean of 5000USD loans and by focusing on points below that line we can see that they belong to a Student groups which doesn’t seem to change over time significantly and colors are associated to a flag - is not a homeowner.

Along similar lines, if a borrower is a homeowner, he or she will have on average similar or higher loans than if a borrower is not a homeowner. It is interesting to see that some of the ListingCathegories are constrained to last 4 years and some are more expanding over almost entire time span of a dataset, like debt consolidation, home improvements, businesses and automobiles. In addition, personal loans and student use is absent in last 4-6 years.


Final Plots and Summary

Plot One

Description One

Main feature of my interest was the maginute of original loan amounts Histogram shows count of the magnitude or the original loan amounts that borrowers have taken. What is evident from this plot is the skewness of the distribution with the right tail. In addition, There are several local peaks breaking the ideally right skewed distribution.

Plot Two

Description Two

Plot two is focusing on showing a relationship of distributons of initial loan amounts vs year the loan is initiated. Medians for loan original amounts have been increasing since 2009 with a peak in 2014. In addition, there has almost absence of the outliers in a range over 30 000USD until year 2013 and 2014. All of the distributions are skewed.

Plot Three

Description Three

Plot three carries a lot of information which enriches the plots from first and second plot with additional variables to see multivariate relationships with the main feature of interest - average original loan amounts per year, faceted based on the listing cathegory and colored by a variable which says if the borrower is already a homeowner or not.

What can be deduced is the following, if a borrower is a homeowner, he or she will have on average similar or higher loans than if a borrower is not a homeowner. It is interesting to see that some of the ListingCathegories are constrained to last 4 years and some are more expanding over almost entire time span of a dataset, like debt consolidation, home improvements, businesses and automobiles. In addition, personal loans and student use are completely absent in last 4-6 years.


Reflection

Exploration of dataset was challenging because firstly due to the time constraints I had to pick and choose 10-15 variables from 81 of them which was not easy not having extensive domain knowledge on the topic.

Univariate analyses gave me room to explore distributions of most important variables in the dataset.

Bivariate analyses pointed me in directions where should I focus when exploring two variables at the same time and increased my curiousity to explore timeseries change.

Multivariate analyses in fully made me able to explore the relationships between time, magnitude of loans and two additional variables like home ownership and listing type in the same visualisation - very powerful tools.

Some of the strugles I encountered were lack of domain knowledge in the topic and wrangling the datasets, preparing the data so it is in right format for plotting to happen in the right manner.

What went pretty well is generally how I handeled the who EDA process:)

As a future insight I would like to explore the dataset more in spatial domain and also start including more relevant variables from the full dataset.

In addition I would like to spend more time into quantifying potential relationships that were detected during the visual inspections of plots.